Mini Project Activity

Crimes in Boston

created
-- by Jesy Jeff Laura E
-- for IPEC Solutions

blob.jpg

Boston, known as the "Hub of the Universe," is a city with a rich history and vibrant culture. Established in 1630, it played a significant role in the American Revolution and boasts numerous historical landmarks. Today, Boston is a thriving metropolis, home to prestigious universities like Harvard and MIT, fostering a climate of innovation and intellectual curiosity. The city's diverse neighborhoods each have their own unique charm, from the cobblestone streets of Beacon Hill to the bustling multicultural hub of Chinatown. With world-class museums, beautiful parks, passionate sports fans, and a lively arts scene, Boston offers a captivating blend of tradition and modernity, making it a captivating destination for residents and visitors alike.¶

Data Understanding¶

  1. INCIDENT_NUMBER: Unique number of each case (One case can contain multiple offences)
  2. OFFENSE_CODE: Code of Crime's Offense.
  3. OFFENSE_CODE_GROUP: Name of the offense.
  4. OFFENSE_DESCRIPTION: Description of the offense.
  5. DISTRICT: Boston Disctrict Code
  6. REPORTING_AREA: Area number where crimes occurred.
  7. SHOOTING: Indicate if there are a shooting involves.
  8. OCCURRED_ON_DATE: Time of incidents/crime.
  9. YEAR: Year of the occurred crimes.
  10. MONTH: Month of the occurred crimes.
  11. DAY OF WEEK: Day of the occurred crimes.
  12. HOUR: Hour of the occurred crimes.
  13. UCR_PART: Uniform Crime Reporting Offence types.
  14. STREET: Street name where crime occurred.
  15. LAT: The latitude where the crime happened.
  16. LONG: The longitude where the crime happened.
  17. LOCATION: Combination of Lat and Long (Lat, Long).
  18. DATE: date of incidents.
  19. AGE: age of the offense group.
  20. SEX: sex of the offense group.

Understand the Hierarchy of City - Boston¶

In Boston, the administrative hierarchy can be represented as follows:

1. Country: United States

   - The highest level of administrative division, encompassing the entire country.

2. State: Massachusetts

   - The state in which Boston is located.

3. County: Suffolk County

   - The county in which Boston is located. Suffolk County includes the city of Boston and some neighboring areas.

4. City: Boston

   - The city of Boston itself, which is the capital and largest city of Massachusetts.**

5. Neighborhoods/Districts: Boston is further divided into several neighborhoods or districts, each with its own characteristics and local governance. Some of the well-known neighborhoods in Boston include:

   A1: Downtown
   A15: Charlestown
   A7: East Boston
   B2: Roxbury
   B3: Mattapan
   C6: South Boston
   C11: Dorchester
   D4: South End
   D14: Brighton
   E5: West Roxbury
   E13: Jamaica Plain
   E18: Hyde Park

These administrative divisions outline the hierarchical structure of Boston's governance and provide a framework for managing and providing services to different areas within the city.

Phase-1
¶

Define Problem statement¶

The Crimes in Boston dataset capturing the type of incident as well as when and where it occurred, a well-known dataset in the field of machine learning and data analysis, contains measurements of different fields of crime given by BPD. This project aims to perform an in-depth exploratory data (EDA) and statistical analysis of the crime dataset to gain insights into the characteristics of these incidents that are provided by Boston Police Department to initial details surrounding an incident to which BPD officers respond and capturing the type of incident as well as when and where it occurred in and make meaningful conclusions.

Related of data collection of stated problem:

  • Define Problem Data.
  • Collect the data from assigned source.
  • Understanding of Dataset.

Create project plan and product backlog

Objective:

  1. Define project objectives:
  2. Data collection and loading:
  3. Exploratory data analysis:
  4. Statistical analysis:
  5. Hypothesis Testing:
  6. Data preparation and cleaning:
  7. Model Development:
  8. Model Validation:
  9. Documentation:
  10. Deployment and Integration:
  11. Testing and Quality Assurance:
  12. Maintenance and Monitoring:
  13. Report writing:
  14. conclusion:

To perform an in-depth of the Crimes in Boston dataset, you can follow these steps:

1. Define project objectives:

  • Clearly identify the problem to be solved or the question to be answered.
  • Establish specific, measurable, achievable, relevant, and time-bound (SMART) goals for the project.
  • Define the target audience for the project outcomes.

2. Data collection and loading:

  • Identify and prepare data sources.
  • Choose a data loading tool.
  • Configure the data loading tool.
  • Load the data.

3. Exploratory data analysis:

  • Get an overview of the data.
  • Identify the most common crimes.
  • Identify the most crime-prone areas.
  • Analyses the temporal distribution of crimes.
  • Analyses the relationship between different crime variables.

4. Statistical analysis:

  • Perform statistical tests to identify significant relationships between crime variables.
  • Develop predictive models to forecast future crime rates.

5. Hypothesis Testing:

  • Formulate hypotheses based on the research question and exploratory analysis.
  • Select appropriate statistical tests to evaluate the validity of the hypotheses.
  • Interpret p-values and confidence intervals to draw conclusions.

6. Data preparation and cleaning:

  • Import the necessary libraries.
  • Load the dataset.
  • Check for missing values and outliers.
  • Clean and prepare the data for analysis.

7. Model Development:

  • Split the data into training and testing sets.
  • Implement and train various regression models (e.g., Linear Regression, Decision Trees, Random Forest).
  • Evaluate model performance using appropriate metrics.
  • Fine-tune hyperparameters.

8. Model Validation:

  • alidate the model's performance on unseen data.
  • Cross-validation to assess model generalization.
  • Address any overfitting or underfitting issues.

9. Documentation:

  • Create documentation for the project, including the dataset, data preprocessing steps, and model details.
  • Prepare a report or presentation summarizing the findings and the model's performance.

10. Deployment and Integration:

  • Deploy the model to a relevant platform or system.
  • Integrate the model with any necessary applications or interfaces.

11. Testing and Quality Assurance:

  • Thoroughly test the deployed model to ensure it functions as expected.
  • Address any issues or bugs.

12. Maintenance and Monitoring:

  • Establish a system for monitoring model performance in production.
  • Plan for model updates and maintenance.

13. Report writing:

  • Write a report that summarizes the findings of the analysis and provides recommendations for reducing crime in Boston.

14. conclusion:

  • Summarize the key findings and insights derived from the project.
  • Discuss the implications of the results and potential applications.
  • Outline future directions for research and follow-up studies.

Create Git Repository

  1. Go to the GitHub homepage.
  2. Click on the plus icon in the top right corner of the page, and select “New repository” from the dropdown menu.
  3. On the next page, enter a name for your repository in the “Repository name” field.
  4. Optionally, add a description of your repository in the “Description” field.
  5. Choose whether you want your repository to be public or private.
  6. If you want to initialize your repository with a README, select the “Add a README file” checkbox.
  7. Click on the “Create repository” button.

My Git Repository

GitHub

Phase-2
¶

Phase-2 (Summary)

Statistical Analysis

Data Exploration and analysis for the stated problem & Given Dataset (Coding)

  1. Frame 10 questions on Probability & Statistics
  2. Dispersion for the parameters
  3. Data distribution
  4. Visualize above with Distribution, Histogram & Scatter Plots
  5. Test statistic
  6. Test type (T-test, Z-test, F-test, ANNOVA, Chi-Square, PCA)
  7. Interpreting test statistics

Read the Dataset¶

In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('C:/Users/jesy jeff laura.e/OneDrive/Desktop/CRIMEB-2.csv' ,encoding='latin-1')
df.head(5)
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\3190502872.py:1: DtypeWarning: Columns (0,2,3,4,5,6,7,10,12,13,16,17,19) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('C:/Users/jesy jeff laura.e/OneDrive/Desktop/CRIMEB-2.csv' ,encoding='latin-1')
Out[2]:
INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location DATE AGE Sex
0 I182080058 2403.0 Disorderly Conduct DISTURBING THE PEACE E18 495 NaN 03-10-2018 20.13 2018.0 10.0 Wednesday 20.0 Part Two ARLINGTON ST 42.262608 -71.121186 (42.26260773, -71.12118637) 03-10-2018 23.0 male
1 I182080053 3201.0 Property Lost PROPERTY - LOST D14 795 NaN 30-08-2018 20.00 2018.0 8.0 Thursday 20.0 Part Three ALLSTON ST 42.352111 -71.135311 (42.35211146, -71.13531147) 30-08-2018 18.0 female
2 I182080052 2647.0 Other THREATS TO DO BODILY HARM B2 329 NaN 03-10-2018 19.20 2018.0 10.0 Wednesday 19.0 Part Two DEVON ST 42.308126 -71.076930 (42.30812619, -71.07692974) 03-10-2018 24.0 female
3 I182080051 413.0 Aggravated Assault ASSAULT - AGGRAVATED - BATTERY A1 92 NaN 03-10-2018 20.00 2018.0 10.0 Wednesday 20.0 Part One CAMBRIDGE ST 42.359454 -71.059648 (42.35945371, -71.05964817) 03-10-2018 56.0 female
4 I182080050 3122.0 Aircraft AIRCRAFT INCIDENTS A7 36 NaN 03-10-2018 20.49 2018.0 10.0 Wednesday 20.0 Part Three PRESCOTT ST 42.375258 -71.024663 (42.37525782, -71.02466343) 03-10-2018 57.0 male

Statistical Analysis¶

Data Exploration and analysis for the stated problem & Given Dataset.¶

A. Frame 10 questions on Probability & Statistics¶

Probability¶

  1. What is the probability of a crime occurring in a given neighborhood on a given day?
  2. What is the probability of a certain type of crime being committed (e.g., robbery, burglary, etc.)?
  3. What is the probability of a crime being shooted?
  4. What is the probability of a crime occurring in Boston on a weekday?
  5. What is the probability of a crime occurring in Boston in each year?
  6. What is the probability that a crime committed in Boston occurs on a weekend?

1. What is the probability of a crime occurring in a given neighborhood on a given day?¶

In [3]:
import pandas as pd
import numpy as np

# Calculate the total number of crimes that occurred in each neighborhood
neighborhood_crime_counts = df['DISTRICT'].value_counts()

# Calculate the probability of a crime occurring in a given neighborhood on a given day
neighborhood_crime_probabilities = neighborhood_crime_counts / df.shape[0]

# Print the probability of a crime occurring in the neighborhood with the highest crime rate
print(neighborhood_crime_probabilities.max())
0.09758455025448319

2. What is the probability of a certain type of crime being committed (e.g., robbery, burglary, etc.)?¶

In [4]:
# Calculate the total number of crimes of each type
crime_type_counts = df['OFFENSE_CODE_GROUP'].value_counts()

# Calculate the probability of a certain type of crime being committed
crime_type_probabilities = crime_type_counts / df.shape[0]

# Print the probability of the most common type of crime being committed
print(crime_type_probabilities.max())
0.07255672358845075

3.What is the probability of a crime being Shooted?¶

In [5]:
# Calculate the number of crimes that were solved
df1 = df[df['SHOOTING'] == True].shape[0]

# Calculate the probability of a crime being solved
crime_shooting_probability = df1 / df.shape[0]

# Print the probability of a crime being solved
print(crime_shooting_probability)
0.0

4.What is the probability of a crime occurring in Boston on a weekday?¶

In [ ]:
# Convert to datetime format
df['DATE'] = pd.to_datetime(df['DATE'])

# Create new column for day of week
df['DAY_OF_WEEK'] = df['DATE'].dt.day_name()

# Calculate total number of crimes
total_crimes = len(df)

# Calculate total number of crimes that occurred on weekdays
weekday_crimes = len(df[df['DAY_OF_WEEK'].isin(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])])

# Calculate probability of crimes occurring on weekdays
prob_weekday_crime = weekday_crimes / total_crimes

print(f"Probability of crimes occurring on weekdays: {prob_weekday_crime:.2f}")

5.What is the probability of a crime occurring in Boston in each year?¶

In [ ]:
# Convert to datetime format
df['DATE'] = pd.to_datetime(df['DATE'])

# Create new column for year
df['YEAR'] = df['DATE'].dt.year

# Calculate total number of crimes
total_crimes = len(df)

# Calculate total number of crimes that occurred in each year
crimes_by_year = df.groupby('YEAR').size().reset_index(name='COUNT')

# Calculate probability of crimes occurring in each year
crimes_by_year['PROBABILITY'] = crimes_by_year['COUNT'] / total_crimes

print(crimes_by_year[['YEAR', 'PROBABILITY']])

6.What is the probability that a crime committed in Boston occurs on a weekend?¶

In [ ]:
# Convert the 'OCCURRED_ON_DATE' column to datetime format
df['DATE'] = pd.to_datetime(df['DATE'])

# Filter out crimes that occurred on weekends
df_weekend = df[df['DATE'].dt.dayofweek.isin([5, 6])]

# Calculate the probability of a crime being committed on a weekend
prob_weekend = len(df_weekend) / len(df)

print(f'The probability of a crime being committed on a weekend in Boston is {prob_weekend:.2%}.')

Statistics¶

  1. What is the average crime hour in Boston?
  2. What is the median crime hour in Boston?
  3. Which district has reported the highest number of crimes in the dataset?
  4. What is the least common type of crime in Boston?
  5. What is the most common type of crime in Boston?
  6. Print Summarizes of the dataset.

1. What is the average crime hour in Boston?¶

In [6]:
# Calculate the average crime hour in Boston
average_crime_hour = df['HOUR'].mean()

# Print the average crime hour in Boston
print(average_crime_hour)
13.114840461228724

2.What is the median crime hour in Boston?¶

In [7]:
# Calculate the median crime hour in Boston
median_crime_rate = df['HOUR'].median()

# Print the median crime hour in Boston
print(median_crime_rate)
14.0

3.Which district has reported the highest number of crimes in the dataset?¶

In [8]:
# Group the data by district and count the number of offenses
offenses_by_district = df.groupby('DISTRICT')['INCIDENT_NUMBER'].count()

# Find the district with the highest number of offenses
district_with_most_offenses = offenses_by_district.idxmax()

print(f"The district with the highest number of offenses is {district_with_most_offenses}")
The district with the highest number of offenses is B2

4.What is the least common type of crime in Boston?¶

In [9]:
# Count the number of occurrences of each crime type
crime_counts = df['OFFENSE_DESCRIPTION'].value_counts()

# Select the least common type of crime in Boston
least_common_crime = crime_counts.index[-1]

print(f'The least common type of crime in Boston is {least_common_crime}.')
The least common type of crime in Boston is DRUGS - POSS CLASS D - INTENT MFR DIST DISP.

5.What is the most common type of crime in Boston?¶

In [10]:
# Count the number of occurrences of each crime type
crime_counts = df['OFFENSE_DESCRIPTION'].value_counts()

# Select the least common type of crime in Boston
most_common_crime = crime_counts.index[1]

print(f'The most common type of crime in Boston is {most_common_crime}.')
The most common type of crime in Boston is INVESTIGATE PERSON.

6. Print Summarizes of the dataset¶

In [11]:
df.describe()
Out[11]:
OFFENSE_CODE YEAR MONTH HOUR Lat Long AGE
count 327820.000000 327820.000000 327820.000000 327820.000000 307188.000000 307188.000000 56911.000000
mean 2317.961171 2016.598676 6.672213 13.114840 42.212995 -70.906030 39.059338
std 1184.990073 1.009775 3.253984 6.292714 2.173496 3.515832 12.400959
min 111.000000 2015.000000 1.000000 0.000000 -1.000000 -71.178674 18.000000
25% 1001.000000 2016.000000 4.000000 9.000000 42.297466 -71.097081 28.000000
50% 2907.000000 2017.000000 7.000000 14.000000 42.325552 -71.077493 39.000000
75% 3201.000000 2017.000000 9.000000 18.000000 42.348624 -71.062482 50.000000
max 3831.000000 2018.000000 12.000000 23.000000 42.395042 -1.000000 86.000000

B.Dispersion for the parameters¶

Dispersion of the data used to understands the distribution of data. It helps to understand the variation of data and provides a piece of information about the distribution data. Range, IQR, variance and standard deviations are the methods used to understand the distribution data.

  • Standard Deviation
  • Variance
  • Range
  • Interquartile Range

i. Standard Deviation¶

What is the standard deviation of the crime hour in Boston?¶

In [12]:
# Calculate the standard deviation of the crime hour in Boston
crime_rate_standard_deviation = df['HOUR'].std()

# Print the standard deviation of the crime hour in Boston
print(crime_rate_standard_deviation)
6.292714255219991

ii. Variance¶

What is the variance of the criminal age in Boston?¶

In [13]:
# Calculate the variance of the 'Hour' column
variance = np.var(df['AGE'])

print(f'The variance of the criminal Age in Boston is {variance:.2f}.')
The variance of the criminal Age in Boston is 153.78.

iii. Range¶

What is the range of Latitude and Longitude of the crime in Boston?¶

In [14]:
print("Range of latitude is :",df['Lat'].max()-df['Lat'].min())
print("Range of longitude is :",df['Long'].max()-df['Long'].min())
Range of latitude is : 43.39504158
Range of longitude is : 70.17867378

iv. Interquartile Range¶

What is the Interquartile Range of the criminal age in boston?¶

In [15]:
# Calculate the IQR of the 'Hour' column
Q1 = df['AGE'].quantile(0.25)
Q3 = df['AGE'].quantile(0.75)
IQR = Q3 - Q1

print(f'The interquartile range (IQR) of the criminal age in Boston is {IQR:.2f}.')
The interquartile range (IQR) of the criminal age in Boston is 22.00.

C.Data Distribution¶

1.What is Data Distribution ?¶

Researchers that collect data during studies often find themselves with large sets of data that they need to simplify in order for them to communicate their findings to different audiences. To do this, they often use what is called a data distribution. A data distribution is a graphical representation of data that was collected from a sample or population. It is used to organize and disseminate large amounts of information in a way that is meaningful and simple for audiences to digest.

2.What are the different types of data distribution?¶

There are two types of data distribution based on two different kinds of data: Discrete and Continuous. Discrete data distributions include binomial distributions, Poisson distributions, and geometric distributions. Continuous data distributions include normal distributions and the Student's t-distribution.

3.How do you find the distribution of data?¶

A probability plot is used to determine the distribution of data. It is a test that graphs data points along a straight line. Data that fit along that line qualify as that given type of distribution.

Visualize above with Distribution, Histogram & Scatter Plots¶

i. Histogram¶

In [16]:
%matplotlib inline
import matplotlib.pyplot as plt
df.plot(kind='hist')
Out[16]:
<Axes: ylabel='Frequency'>
In [17]:
# Create a histogram plot for the crime data
plt.hist(df['OFFENSE_CODE'], bins=50)
plt.xlabel('Offense Code')
plt.ylabel('Frequency')
plt.title('Distribution of Crime Data in Boston')
plt.show()

ii.Scatter Plot¶

In [ ]:
# Convert the 'Date' column to a datetime object
df['DATE'] = pd.to_datetime(df['DATE'])

# Group the data by month and count the number of crimes
monthly_crime_counts = df.groupby(pd.Grouper(key='DATE', freq='M')).size()

# Create a scatter plot of the monthly crime counts
plt.scatter(monthly_crime_counts.index, monthly_crime_counts.values)

# Set the title and axis labels
plt.title('Monthly Crime Counts in Boston')
plt.xlabel('Month')
plt.ylabel('Number of Crimes')

# Display the plot
plt.show()
In [ ]:
import seaborn as sns
sns.scatterplot(data=df)

iii Distribution Plot¶

In [19]:
import seaborn as sns
sns.distplot(df['OFFENSE_CODE'], kde = False, color ='red', bins = 30)
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1320495804.py:2: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(df['OFFENSE_CODE'], kde = False, color ='red', bins = 30)
Out[19]:
<Axes: xlabel='OFFENSE_CODE'>

D.Test statistic¶

A test statistic describes how closely the distribution of your data matches the distribution predicted under the null hypothesis of the statistical test you are using.

The distribution of data is how often each observation occurs, and can be described by its central tendency and variation around that central tendency. Different statistical tests predict different types of distributions, so it’s important to choose the right statistical test for your hypothesis.

The test statistic summarizes your observed data into a single number using the central tendency, variation, sample size, and number of predictor variables in your statistical model.

Formulas for Test Statistics¶

Test Statistic Formula Finding
T-value for 1-sample t-test image.png Take the sample mean, subtract the hypothesized mean, and divide by the standard error of the mean.
T-value for 2-sample t-test image-2.png Take one sample mean, subtract the other, and divide by the pooled standard deviation.
F-value for F-tests and ANOVA image-3.png Calculate the ratio of two variances.
Chi-squared value (χ2) for a Chi-squared test image-4.png Sum the squared differences between observed and expected values divided by the expected values.

Test type (T-test, Z-test, F-test, ANNOVA, Chi-Square, PCA)¶

T-test¶

What is T-Test?¶

A t-test is a statistical hypothesis test that is used to determine whether there is a significant difference between the means of two groups. It helps you assess whether any observed differences between the groups are likely to have occurred by chance or if they are statistically significant.

Types of T-tests (With Solved Examples in R)

There are three types of t-tests we can perform based on the data at hand:

  • One sample t-test
  • Independent two-sample t-test
  • Paired sample t-test

T test formula : The formula for the two-sample t test (a.k.a. the Student’s t-test) is shown below.

image.png

In [20]:
from scipy.stats import ttest_ind

# Split the dataset into two groups
group1 = df[df['DISTRICT'] == 'D14']
group2 = df[df['DISTRICT'] == 'B2']

# Perform the t-test
t, p = ttest_ind(group1['OFFENSE_CODE'], group2['OFFENSE_CODE'])

# Print the results
print('t-value:', t)
print('p-value:', p)
t-value: 0.539141126234205
p-value: 0.5897911489474033

Z-test¶

What is Z-test?¶

A Z-test is a statistical hypothesis test that is used to determine whether there is a significant difference between the sample mean and a known population mean when the population standard deviation is known. It is particularly useful when dealing with large sample sizes and normally distributed data. Z-tests are a parametric test, which means they make certain assumptions about the data, such as normality and known population standard deviation.

The formula for a Z-test statistic is:

image-2.png

In [21]:
from scipy.stats import norm

# Split the dataset into two groups
group1 = df[df['DISTRICT'] == 'D14']
group2 = df[df['DISTRICT'] == 'B2']

# Calculate the mean and standard deviation of each group
mean1, std1 = group1['OFFENSE_CODE'].mean(), group1['OFFENSE_CODE'].std()
mean2, std2 = group2['OFFENSE_CODE'].mean(), group2['OFFENSE_CODE'].std()

# Calculate the standard error of the difference between means
se = ((std1 ** 2) / len(group1) + (std2 ** 2) / len(group2)) ** 0.5
z = (mean1 - mean2) / se

# Calculate the p-value
p = 1 - norm.cdf(abs(z))

# Print the results
print('z-value:', z)
print('p-value:', p)
z-value: 0.5323332640349332
p-value: 0.2972475987305382

ANOVA¶

ANOVA stands for Analysis of Variance, and it is a statistical technique used to analyze the differences among group means in a sample. ANOVA is especially useful when you want to compare the means of three or more groups or treatments to determine if there are significant differences between them. It helps in assessing whether the variation between group means is greater than what would be expected by random chance.

Formula for One-way ANOVA:

The formula for one-way ANOVA involves calculating the F-statistic, which follows an F-distribution. Here's the basic formula:

image.png

In [22]:
from scipy.stats import f_oneway

# Split the dataset into three groups
group1 = df[df['DISTRICT'] == 'D14']
group2 = df[df['DISTRICT'] == 'B2']
group3 = df[df['DISTRICT'] == 'A1']

# Perform the F-test
f, p = f_oneway(group1['OFFENSE_CODE'], group2['OFFENSE_CODE'], group3['OFFENSE_CODE'])

# Print the results
print('F-value:', f)
print('p-value:', p)
F-value: 482.9052727105152
p-value: 1.5978452466819635e-209

Chi-Square¶

Chi-Square, often denoted as χ² (chi-squared), is a statistical test used to determine if there is a significant association or relationship between two categorical variables in a contingency table. It is a non-parametric test, meaning it does not rely on any assumptions about the distribution of the data, making it suitable for categorical data analysis.

The Chi-Square test formula is used to calculate the Chi-Square statistic, which quantifies the difference between the observed and expected frequencies in a contingency table. Here's the formula for the Chi-Square statistic in the context of a 2x2 contingency table:

image.png

In [23]:
from scipy.stats import chi2_contingency

# Create a contingency table for crime rates by neighborhood
crime_table = df.groupby('DISTRICT')['OFFENSE_CODE'].value_counts()

# Perform the Chi-Square test
chi2_statistic, p_value, degrees_of_freedom, expected_counts = chi2_contingency(crime_table)

# Print the results
print('Chi-Square statistic:', chi2_statistic)
print('p-value:', p_value)
print('Degrees of freedom:', degrees_of_freedom)
print('Expected counts:', expected_counts)
Chi-Square statistic: 0.0
p-value: 1.0
Degrees of freedom: 0
Expected counts: [2.220e+03 1.917e+03 1.912e+03 ... 1.000e+00 1.000e+00 1.000e+00]

PCA¶

Principal Component Analysis (PCA) is a widely used technique in statistics and data science for dimensionality reduction and data visualization. Its importance lies in several key applications and benefits:

  1. Dimensionality Reduction: PCA is primarily used for reducing the dimensionality of large datasets while retaining as much of the original variability as possible. By transforming the data into a new set of variables (principal components), it eliminates redundant or less important information, making complex data more manageable and easier to analyze.

  2. Data Visualization: PCA is a powerful tool for visualizing data in lower-dimensional spaces. It helps to project high-dimensional data onto a lower-dimensional subspace, making it possible to represent data in two or three dimensions, which can be easily visualized in scatter plots or other graphical forms.

  3. Noise Reduction: In many datasets, there is noise or irrelevant information that can make analysis challenging. PCA can help remove this noise by focusing on the principal components that capture the most significant variation in the data.

  4. Pattern Recognition and Clustering: PCA can be used as a preprocessing step for pattern recognition and clustering algorithms. It can help improve the performance of these techniques by reducing the feature space while retaining essential information.

  5. Feature Engineering: In machine learning, feature engineering is a crucial step in model development. PCA can be used to create new features or reduce the dimensionality of feature sets, leading to more efficient and accurate models.

  6. Multicollinearity Mitigation: In regression analysis, multicollinearity (high correlation among predictor variables) can lead to unstable coefficient estimates. PCA can address this issue by transforming the correlated predictors into orthogonal (uncorrelated) principal components.

  7. Anomaly Detection: PCA can be used for anomaly or outlier detection by examining data points that deviate significantly from the expected pattern in the lower-dimensional subspace.

  8. Compression: In data storage and transmission, PCA can be used to compress data while retaining critical information. This is particularly useful in scenarios where storage or bandwidth is limited.

  9. Eigenvector and Eigenvalue Analysis: PCA is built on the mathematical concepts of eigenvectors and eigenvalues, which have applications beyond PCA, including physics, engineering, and quantum mechanics.

  10. Interpretability: PCA often leads to more interpretable and understandable representations of data. The principal components can be analyzed to understand which original variables or features contribute the most to the variance.

  11. Machine Learning: PCA can be integrated into machine learning pipelines as a preprocessing step to improve model performance, reduce overfitting, and speed up training.

In [24]:
# Impute missing values in the age column
df['OFFENSE_CODE'] = df['OFFENSE_CODE'].fillna(df['OFFENSE_CODE'].mean())
df.head()
Out[24]:
INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location DATE AGE Sex
0 I182080058 2403.0 Disorderly Conduct DISTURBING THE PEACE E18 495 NaN 03-10-2018 20.13 2018.0 10.0 Wednesday 20.0 Part Two ARLINGTON ST 42.262608 -71.121186 (42.26260773, -71.12118637) 03-10-2018 23.0 male
1 I182080053 3201.0 Property Lost PROPERTY - LOST D14 795 NaN 30-08-2018 20.00 2018.0 8.0 Thursday 20.0 Part Three ALLSTON ST 42.352111 -71.135311 (42.35211146, -71.13531147) 30-08-2018 18.0 female
2 I182080052 2647.0 Other THREATS TO DO BODILY HARM B2 329 NaN 03-10-2018 19.20 2018.0 10.0 Wednesday 19.0 Part Two DEVON ST 42.308126 -71.076930 (42.30812619, -71.07692974) 03-10-2018 24.0 female
3 I182080051 413.0 Aggravated Assault ASSAULT - AGGRAVATED - BATTERY A1 92 NaN 03-10-2018 20.00 2018.0 10.0 Wednesday 20.0 Part One CAMBRIDGE ST 42.359454 -71.059648 (42.35945371, -71.05964817) 03-10-2018 56.0 female
4 I182080050 3122.0 Aircraft AIRCRAFT INCIDENTS A7 36 NaN 03-10-2018 20.49 2018.0 10.0 Wednesday 20.0 Part Three PRESCOTT ST 42.375258 -71.024663 (42.37525782, -71.02466343) 03-10-2018 57.0 male
In [25]:
# Impute missing values in the age column
df['AGE'] = df['AGE'].fillna(df['AGE'].mean())
df.head()
Out[25]:
INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location DATE AGE Sex
0 I182080058 2403.0 Disorderly Conduct DISTURBING THE PEACE E18 495 NaN 03-10-2018 20.13 2018.0 10.0 Wednesday 20.0 Part Two ARLINGTON ST 42.262608 -71.121186 (42.26260773, -71.12118637) 03-10-2018 23.0 male
1 I182080053 3201.0 Property Lost PROPERTY - LOST D14 795 NaN 30-08-2018 20.00 2018.0 8.0 Thursday 20.0 Part Three ALLSTON ST 42.352111 -71.135311 (42.35211146, -71.13531147) 30-08-2018 18.0 female
2 I182080052 2647.0 Other THREATS TO DO BODILY HARM B2 329 NaN 03-10-2018 19.20 2018.0 10.0 Wednesday 19.0 Part Two DEVON ST 42.308126 -71.076930 (42.30812619, -71.07692974) 03-10-2018 24.0 female
3 I182080051 413.0 Aggravated Assault ASSAULT - AGGRAVATED - BATTERY A1 92 NaN 03-10-2018 20.00 2018.0 10.0 Wednesday 20.0 Part One CAMBRIDGE ST 42.359454 -71.059648 (42.35945371, -71.05964817) 03-10-2018 56.0 female
4 I182080050 3122.0 Aircraft AIRCRAFT INCIDENTS A7 36 NaN 03-10-2018 20.49 2018.0 10.0 Wednesday 20.0 Part Three PRESCOTT ST 42.375258 -71.024663 (42.37525782, -71.02466343) 03-10-2018 57.0 male
In [26]:
from sklearn.decomposition import PCA
import numpy as np
n=['OFFENSE_CODE','AGE']
data_stats=df[n]
pca = PCA(n_components=0.9) 
principal_components = pca.fit_transform(data_stats)
num_components = pca.n_components_
num_components
Out[26]:
1

Phase-3¶

Exploratory Data Analysis¶

Data Exploration and analysis for the stated problem & Given Dataset¶

Libraries¶

In [ ]:
!pip install plotly
In [27]:
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly.express as px
import scipy.stats as stat
import seaborn as sns
import pandas as pd
import numpy as np
import calendar
import json

1.What are the most common crimes group in terms of offense type?¶

In [28]:
crime = df['OFFENSE_CODE_GROUP'].value_counts().index[:10]
crime_count = df['OFFENSE_CODE_GROUP'].value_counts().values[:10]

plt.figure(figsize=(12,8))
ax = sns.barplot(y = crime , x = crime_count, orient='h', palette='Reds_r')
plt.xlabel(xlabel='OFFENSE_NAME')
plt.ylabel(ylabel='OFFENSE_GROUP')
plt.title("OFFENSE_DESCRIPTION", fontdict = {'size' : 'xx-large', 'fontweight' : 'bold'})
plt.show()
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\248667759.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  ax = sns.barplot(y = crime , x = crime_count, orient='h', palette='Reds_r')

2.Under which UCR part most number of crimes committed ?¶

In [29]:
labels = df['UCR_PART'].astype('category').cat.categories.tolist()
counts = df['UCR_PART'].value_counts()
sizes = [counts[var_cat] for var_cat in labels]
fig1, ax1 = plt.subplots(figsize = (22,12))
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=140, textprops={'color':"black", 'size' : 'x-large', 'fontweight' : 'bold'}) 
ax1.axis('equal')
plt.show()

3.How crimes are distributed across districts?¶

In [30]:
order = df['OFFENSE_CODE_GROUP'].value_counts().head(15).index
plt.figure(figsize = (30,10))
sns.countplot(df, x='OFFENSE_CODE_GROUP',hue=df.DISTRICT, order = order ,palette="cubehelix");
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xticks(rotation=90)
plt.show()

4.How many Crimes in Boston by Year?¶

In [31]:
df['YEAR'] = df['DATE'].str[-4:]

yeared=df.groupby("YEAR").size()
yeared.plot(kind="line",color="green",linewidth=4)
plt.title("Crime in Boston by Year")
plt.ylabel("Number of Crimes")
plt.show()

5.How many number of cases count in Year?¶

In [32]:
fig = px.histogram(df, x=['YEAR'], template='plotly_white', 
                opacity=0.7,log_y=True, labels={'x':'YEARS', 'y':'Case Number Count'} )
fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), showlegend=False)
fig.show()

6.Count the UCR_PART in each District?¶

In [33]:
plt.figure(figsize=(12, 6))
sns.barplot(x='UCR_PART', y='DISTRICT', data=df, palette='Set1')
plt.title('UCR_PART')
plt.xlabel('DISTRICT')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\698744356.py:2: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.


7.Count the crimes done in each year,month,week,hour?¶

In [ ]:
fig, ax =plt.subplots(1,4, figsize=(30, 6), sharey=False)
sns.countplot(df['YEAR'], ax=ax[0])
sns.countplot(df['MONTH'], ax=ax[1])
sns.countplot(df['DAY_OF_WEEK'], ax=ax[2])
sns.countplot(df['HOUR'], ax=ax[3])
fig.show()

8.Pair Plot¶

In [ ]:
sns.pairplot(df)

9.Violin Chart¶

In [34]:
import plotly.express as pt 
import pandas as pd 
  
fig = pt.violin(df, y="YEAR") 
fig.show()

10.HeatMap¶

In [35]:
fig, ax = plt.subplots(figsize=(10, 6))

week_and_hour = df.groupby(['HOUR', 'DAY_OF_WEEK']).count()['OFFENSE_CODE_GROUP'].unstack()

week_and_hour.columns = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

heatmap = sns.heatmap(week_and_hour, cmap=sns.cubehelix_palette(as_cmap=True), ax=ax)
heatmap.set_yticklabels(heatmap.get_yticklabels(), rotation=0)

plt.xlabel('')
plt.ylabel('Hour')

plt.show()

11.Print crime group by using catplot?¶

In [36]:
sns.catplot(y='OFFENSE_CODE_GROUP',
            kind='count',
            height=11, 
            aspect=2,
            order=df.OFFENSE_CODE_GROUP.value_counts().index,
            data=df)
Out[36]:
<seaborn.axisgrid.FacetGrid at 0x1e3dfdbeb60>

12.Scatter Plot¶

In [37]:
import matplotlib as mpl 
df.Lat.replace(-1, None, inplace=True)
df.Long.replace(-1, None, inplace=True)

mpl.rcParams["figure.figsize"] = 21,11

plt.subplots(figsize=(11,6))
sns.scatterplot(x='Lat', y='Long', alpha=0.1, data=df)
plt.legend(loc=2)
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Out[37]:
<matplotlib.legend.Legend at 0x1e3fc0edba0>

13.The most frequent 15 crimes among to Boston's Neighborhoods¶

In [38]:
order = df['OFFENSE_CODE_GROUP'].value_counts().head(15).index
g = sns.FacetGrid(data = df, hue = "MONTH", height = 5)
g.map(sns.kdeplot, "OFFENSE_CODE", shade = True)
g.add_legend()
plt.figure(figsize = (30,10))
sns.countplot(data = df, x='OFFENSE_CODE_GROUP', order = order ,palette="cubehelix");
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xticks(rotation=90)
plt.show()
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1189323747.py:6: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.


No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

14.FacetGrid¶

In [39]:
import seaborn as sns
import pandas as pd

# Define the data
data = pd.DataFrame({'Group': ['A', 'B', 'C', 'D'] * 3,
                     'MONTH': ['Jan', 'Feb', 'Mar'] * 4,
                     'Value': [1, 3, 2, 5, 6, 8, 9, 12, 11, 14, 13, 15]})

# Create the FacetGrid object
g = sns.FacetGrid(data=data, hue="MONTH", height=5)

# Plot the kdeplot
g.map(sns.kdeplot, "Value", shade=True).add_legend();
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


15.GIS (Geographic Information System)¶

In [ ]:
!pip install folium
In [40]:
import folium
from folium.plugins import HeatMap
B2_district=df.loc[df.DISTRICT=='B2'][['Lat','Long']]
B2_district.Lat.fillna(0, inplace = True)
B2_district.Long.fillna(0, inplace = True) 

map_1=folium.Map(location=[42.356145,-71.064083], 
                 tiles = "OpenStreetMap",
                zoom_start=11)

folium.CircleMarker([42.319945,-71.079989],
                        radius=70,
                        fill_color="#b22222",
                        popup='Homicide',
                        color='red',
                       ).add_to(map_1)


HeatMap(data=B2_district, radius=16).add_to(map_1)

map_1
Out[40]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Phase-4¶

Import Libraries¶

In [41]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

1.Read the Dataset¶

In [42]:
df = pd.read_csv('C:/Users/jesy jeff laura.e/OneDrive/Desktop/CRIMEB-2.csv' ,encoding='latin-1')
df.head(5)
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\3190502872.py:1: DtypeWarning:

Columns (0,2,3,4,5,6,7,10,12,13,16,17,19) have mixed types. Specify dtype option on import or set low_memory=False.

Out[42]:
INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location DATE AGE Sex
0 I182080058 2403.0 Disorderly Conduct DISTURBING THE PEACE E18 495 NaN 03-10-2018 20.13 2018.0 10.0 Wednesday 20.0 Part Two ARLINGTON ST 42.262608 -71.121186 (42.26260773, -71.12118637) 03-10-2018 23.0 male
1 I182080053 3201.0 Property Lost PROPERTY - LOST D14 795 NaN 30-08-2018 20.00 2018.0 8.0 Thursday 20.0 Part Three ALLSTON ST 42.352111 -71.135311 (42.35211146, -71.13531147) 30-08-2018 18.0 female
2 I182080052 2647.0 Other THREATS TO DO BODILY HARM B2 329 NaN 03-10-2018 19.20 2018.0 10.0 Wednesday 19.0 Part Two DEVON ST 42.308126 -71.076930 (42.30812619, -71.07692974) 03-10-2018 24.0 female
3 I182080051 413.0 Aggravated Assault ASSAULT - AGGRAVATED - BATTERY A1 92 NaN 03-10-2018 20.00 2018.0 10.0 Wednesday 20.0 Part One CAMBRIDGE ST 42.359454 -71.059648 (42.35945371, -71.05964817) 03-10-2018 56.0 female
4 I182080050 3122.0 Aircraft AIRCRAFT INCIDENTS A7 36 NaN 03-10-2018 20.49 2018.0 10.0 Wednesday 20.0 Part Three PRESCOTT ST 42.375258 -71.024663 (42.37525782, -71.02466343) 03-10-2018 57.0 male
In [43]:
df =df.drop('OFFENSE_CODE', axis=1)
df.head(2)
Out[43]:
INCIDENT_NUMBER OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location DATE AGE Sex
0 I182080058 Disorderly Conduct DISTURBING THE PEACE E18 495 NaN 03-10-2018 20.13 2018.0 10.0 Wednesday 20.0 Part Two ARLINGTON ST 42.262608 -71.121186 (42.26260773, -71.12118637) 03-10-2018 23.0 male
1 I182080053 Property Lost PROPERTY - LOST D14 795 NaN 30-08-2018 20.00 2018.0 8.0 Thursday 20.0 Part Three ALLSTON ST 42.352111 -71.135311 (42.35211146, -71.13531147) 30-08-2018 18.0 female
In [44]:
df =df.drop('SHOOTING', axis=1)
df.head(2)
Out[44]:
INCIDENT_NUMBER OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location DATE AGE Sex
0 I182080058 Disorderly Conduct DISTURBING THE PEACE E18 495 03-10-2018 20.13 2018.0 10.0 Wednesday 20.0 Part Two ARLINGTON ST 42.262608 -71.121186 (42.26260773, -71.12118637) 03-10-2018 23.0 male
1 I182080053 Property Lost PROPERTY - LOST D14 795 30-08-2018 20.00 2018.0 8.0 Thursday 20.0 Part Three ALLSTON ST 42.352111 -71.135311 (42.35211146, -71.13531147) 30-08-2018 18.0 female
In [45]:
df.shape
Out[45]:
(525575, 18)
In [46]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525575 entries, 0 to 525574
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   INCIDENT_NUMBER      327820 non-null  object 
 1   OFFENSE_CODE_GROUP   327820 non-null  object 
 2   OFFENSE_DESCRIPTION  327820 non-null  object 
 3   DISTRICT             326046 non-null  object 
 4   REPORTING_AREA       327820 non-null  object 
 5   OCCURRED_ON_DATE     327820 non-null  object 
 6   YEAR                 327820 non-null  float64
 7   MONTH                327820 non-null  float64
 8   DAY_OF_WEEK          327820 non-null  object 
 9   HOUR                 327820 non-null  float64
 10  UCR_PART             327727 non-null  object 
 11  STREET               316843 non-null  object 
 12  Lat                  307188 non-null  float64
 13  Long                 307188 non-null  float64
 14  Location             327820 non-null  object 
 15  DATE                 327820 non-null  object 
 16  AGE                  56911 non-null   float64
 17  Sex                  891 non-null     object 
dtypes: float64(6), object(12)
memory usage: 72.2+ MB

2.Feature Engineering¶

2.1Data Preprocessing¶

- Checkout for missing values and handle them¶

In [47]:
df.isnull().sum()
Out[47]:
INCIDENT_NUMBER        197755
OFFENSE_CODE_GROUP     197755
OFFENSE_DESCRIPTION    197755
DISTRICT               199529
REPORTING_AREA         197755
OCCURRED_ON_DATE       197755
YEAR                   197755
MONTH                  197755
DAY_OF_WEEK            197755
HOUR                   197755
UCR_PART               197848
STREET                 208732
Lat                    218387
Long                   218387
Location               197755
DATE                   197755
AGE                    468664
Sex                    524684
dtype: int64
In [48]:
df=df.fillna(method='bfill')
df.isnull().sum()
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1456028691.py:1: FutureWarning:

DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.

Out[48]:
INCIDENT_NUMBER        197755
OFFENSE_CODE_GROUP     197755
OFFENSE_DESCRIPTION    197755
DISTRICT               197755
REPORTING_AREA         197755
OCCURRED_ON_DATE       197755
YEAR                   197755
MONTH                  197755
DAY_OF_WEEK            197755
HOUR                   197755
UCR_PART               197755
STREET                 197755
Lat                    197755
Long                   197755
Location               197755
DATE                   197755
AGE                    192578
Sex                    524684
dtype: int64
In [49]:
df=df.fillna(method='ffill')
df.isnull().sum()
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\4175291397.py:1: FutureWarning:

DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.

Out[49]:
INCIDENT_NUMBER        0
OFFENSE_CODE_GROUP     0
OFFENSE_DESCRIPTION    0
DISTRICT               0
REPORTING_AREA         0
OCCURRED_ON_DATE       0
YEAR                   0
MONTH                  0
DAY_OF_WEEK            0
HOUR                   0
UCR_PART               0
STREET                 0
Lat                    0
Long                   0
Location               0
DATE                   0
AGE                    0
Sex                    0
dtype: int64

- Checkout for outliers and handle them¶

In [50]:
df.plot(kind='box')
plt.xticks(rotation=90)
plt.show()

- Removing outliers using IQR method¶

In [51]:
#Removing outliers 
#find iqr
q1=df['MONTH'].quantile(0.25)
q3=df['MONTH'].quantile(0.75)
print("Q1=",q1)
print("Q3=",q3)
iqr=q3-q1
print("iqr=",(q3-q1))
# calculate upperlimit and lowerlimit
lower=q1-1.5*iqr
upper=q3+1.5*iqr
print("Lower limits= ",lower)
print("upper limits= ",upper)
df[df['MONTH'] > upper]
df[df['MONTH'] < lower]
df=df[df['MONTH'] < upper]
df['MONTH'].describe()
Q1= 6.0
Q3= 8.0
iqr= 2.0
Lower limits=  3.0
upper limits=  11.0
Out[51]:
count    478406.000000
mean          5.918555
std           2.133930
min           1.000000
25%           5.000000
50%           6.000000
75%           7.000000
max          10.000000
Name: MONTH, dtype: float64
In [52]:
df['MONTH'].plot(kind='box')
plt.show()
In [53]:
df.shape
Out[53]:
(478406, 18)
In [54]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 478406 entries, 0 to 525574
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   INCIDENT_NUMBER      478406 non-null  object 
 1   OFFENSE_CODE_GROUP   478406 non-null  object 
 2   OFFENSE_DESCRIPTION  478406 non-null  object 
 3   DISTRICT             478406 non-null  object 
 4   REPORTING_AREA       478406 non-null  object 
 5   OCCURRED_ON_DATE     478406 non-null  object 
 6   YEAR                 478406 non-null  float64
 7   MONTH                478406 non-null  float64
 8   DAY_OF_WEEK          478406 non-null  object 
 9   HOUR                 478406 non-null  float64
 10  UCR_PART             478406 non-null  object 
 11  STREET               478406 non-null  object 
 12  Lat                  478406 non-null  float64
 13  Long                 478406 non-null  float64
 14  Location             478406 non-null  object 
 15  DATE                 478406 non-null  object 
 16  AGE                  478406 non-null  float64
 17  Sex                  478406 non-null  object 
dtypes: float64(6), object(12)
memory usage: 69.3+ MB

Label Encoder¶

In [55]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['INCIDENT_NUMBER']=le.fit_transform(df['INCIDENT_NUMBER'])
df['OFFENSE_CODE_GROUP']=le.fit_transform(df['OFFENSE_CODE_GROUP'])
df['DAY_OF_WEEK']=le.fit_transform(df['DAY_OF_WEEK'])
df['OFFENSE_DESCRIPTION']=le.fit_transform(df['OFFENSE_DESCRIPTION'])
df['DISTRICT']=le.fit_transform(df['DISTRICT'])
df['REPORTING_AREA']=le.fit_transform(df['REPORTING_AREA'])
df['OCCURRED_ON_DATE']=le.fit_transform(df['OCCURRED_ON_DATE'])
df['UCR_PART']=le.fit_transform(df['UCR_PART'])
df['STREET']=le.fit_transform(df['STREET'])
df['Location']=le.fit_transform(df['Location'])
df['DATE']=le.fit_transform(df['DATE'])
df['Sex']=le.fit_transform(df['Sex'])
df
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:8: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:10: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:11: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:12: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:13: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:14: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[55]:
INCIDENT_NUMBER OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location DATE AGE Sex
0 248345 14 56 10 439 20428 2018.0 10.0 6 20.0 3 222 42.262608 -71.121186 874 101 23.0 1
1 248344 52 175 7 770 198500 2018.0 8.0 4 20.0 2 129 42.352111 -71.135311 14291 996 18.0 0
2 248343 46 209 3 256 20419 2018.0 10.0 6 19.0 3 1222 42.308126 -71.076930 6815 101 24.0 0
3 248342 0 16 0 835 20427 2018.0 10.0 6 20.0 1 695 42.359454 -71.059648 15495 101 56.0 0
4 248341 1 4 2 290 20430 2018.0 10.0 6 20.0 2 3297 42.375258 -71.024663 16736 101 57.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
525570 0 66 226 8 817 142503 2015.0 6.0 1 0.0 2 4275 42.333839 -71.080290 10948 718 47.0 1
525571 0 66 226 8 817 142503 2015.0 6.0 1 0.0 2 4275 42.333839 -71.080290 10948 718 47.0 1
525572 0 66 226 8 817 142503 2015.0 6.0 1 0.0 2 4275 42.333839 -71.080290 10948 718 47.0 1
525573 0 66 226 8 817 142503 2015.0 6.0 1 0.0 2 4275 42.333839 -71.080290 10948 718 47.0 1
525574 0 66 226 8 817 142503 2015.0 6.0 1 0.0 2 4275 42.333839 -71.080290 10948 718 47.0 1

478406 rows × 18 columns

In [99]:
df['REPORTING_AREA'].value_counts()
Out[99]:
REPORTING_AREA
817    198266
0       18273
16       2076
98       1760
256      1682
        ...  
663         8
103         4
715         2
133         1
864         1
Name: count, Length: 880, dtype: int64
In [57]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 478406 entries, 0 to 525574
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   INCIDENT_NUMBER      478406 non-null  int32  
 1   OFFENSE_CODE_GROUP   478406 non-null  int32  
 2   OFFENSE_DESCRIPTION  478406 non-null  int32  
 3   DISTRICT             478406 non-null  int32  
 4   REPORTING_AREA       478406 non-null  int32  
 5   OCCURRED_ON_DATE     478406 non-null  int32  
 6   YEAR                 478406 non-null  float64
 7   MONTH                478406 non-null  float64
 8   DAY_OF_WEEK          478406 non-null  int32  
 9   HOUR                 478406 non-null  float64
 10  UCR_PART             478406 non-null  int32  
 11  STREET               478406 non-null  int32  
 12  Lat                  478406 non-null  float64
 13  Long                 478406 non-null  float64
 14  Location             478406 non-null  int32  
 15  DATE                 478406 non-null  int32  
 16  AGE                  478406 non-null  float64
 17  Sex                  478406 non-null  int32  
dtypes: float64(6), int32(12)
memory usage: 47.4 MB

Vertical split¶

In [58]:
# vertical split
x = df.drop('REPORTING_AREA', axis=1).values  
x
Out[58]:
array([[2.48345e+05, 1.40000e+01, 5.60000e+01, ..., 1.01000e+02,
        2.30000e+01, 1.00000e+00],
       [2.48344e+05, 5.20000e+01, 1.75000e+02, ..., 9.96000e+02,
        1.80000e+01, 0.00000e+00],
       [2.48343e+05, 4.60000e+01, 2.09000e+02, ..., 1.01000e+02,
        2.40000e+01, 0.00000e+00],
       ...,
       [0.00000e+00, 6.60000e+01, 2.26000e+02, ..., 7.18000e+02,
        4.70000e+01, 1.00000e+00],
       [0.00000e+00, 6.60000e+01, 2.26000e+02, ..., 7.18000e+02,
        4.70000e+01, 1.00000e+00],
       [0.00000e+00, 6.60000e+01, 2.26000e+02, ..., 7.18000e+02,
        4.70000e+01, 1.00000e+00]])
In [59]:
x.shape
Out[59]:
(478406, 17)
In [60]:
y = df['UCR_PART'].values.reshape(-1,1) 
y
Out[60]:
array([[3],
       [2],
       [3],
       ...,
       [2],
       [2],
       [2]])

Horizontal split¶

In [61]:
#horizontal split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
In [62]:
print("Shape of X_train: ",x_train.shape)
print("Shape of X_test: ", x_test.shape)
print("Shape of y_train: ",y_train.shape)
print("Shape of y_test",y_test.shape)
Shape of X_train:  (334884, 17)
Shape of X_test:  (143522, 17)
Shape of y_train:  (334884, 1)
Shape of y_test (143522, 1)

Model Training¶

A.Linear Regression =¶

In [63]:
from sklearn.linear_model import LinearRegression
regressor_linear = LinearRegression()
regressor_linear.fit(x_train, y_train)
Out[63]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

1.Intercept & Coefficient¶

In [64]:
import scipy.stats as stats
import statsmodels.formula.api as smf
#intercept
c = regressor_linear.intercept_
#coefficient
m = regressor_linear.coef_
print("Intercept = ",c)
print("Coefficent = ",m)
Intercept =  [-2.18454588e-10]
Coefficent =  [[-1.68214167e-18 -4.44267510e-15  3.93141482e-16  5.93437273e-15
  -1.91858755e-18  1.07905236e-13  1.34032166e-14 -6.75012205e-15
  -3.39157268e-15  1.00000000e+00 -7.93802697e-19  2.48844355e-14
   1.54933020e-14  1.85944950e-19  4.45645683e-16 -6.36305424e-17
  -1.29505729e-15]]
In [65]:
print(x_test)
[[0.00000e+00 6.60000e+01 2.26000e+02 ... 7.18000e+02 4.70000e+01
  1.00000e+00]
 [4.08720e+04 1.50000e+01 7.10000e+01 ... 1.71000e+02 3.70000e+01
  1.00000e+00]
 [7.33510e+04 5.50000e+01 2.06000e+02 ... 7.19000e+02 2.10000e+01
  1.00000e+00]
 ...
 [0.00000e+00 6.60000e+01 2.26000e+02 ... 7.18000e+02 4.70000e+01
  1.00000e+00]
 [2.37344e+05 4.60000e+01 2.09000e+02 ... 6.61000e+02 2.30000e+01
  1.00000e+00]
 [4.37880e+04 4.60000e+01 2.09000e+02 ... 6.38000e+02 3.70000e+01
  1.00000e+00]]
In [66]:
print(y_test)
[[2]
 [3]
 [3]
 ...
 [2]
 [3]
 [3]]

2.Predict¶

In [67]:
y_pred=regressor_linear.predict(x_test)
In [68]:
y_pred
Out[68]:
array([[2.],
       [3.],
       [3.],
       ...,
       [2.],
       [3.],
       [3.]])

3.Compare Actual & Predicted¶

In [69]:
y_test=y_test.ravel()
y_pred=y_pred.ravel()
In [70]:
actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
actual_pred
Out[70]:
Actual Predicted
0 2 2.0
1 3 3.0
2 3 3.0
3 2 2.0
4 2 2.0
... ... ...
143517 3 3.0
143518 2 2.0
143519 2 2.0
143520 3 3.0
143521 3 3.0

143522 rows × 2 columns

In [71]:
include=10
actual=actual_pred.head(include)
actual.plot(kind='bar')
plt.grid()
plt.show()

4.Training and Testing Score¶

In [72]:
print("Training score: ",regressor_linear.score(x_train,y_train))
print("Testing score: ",regressor_linear.score(x_test,y_test))
Training score:  1.0
Testing score:  1.0

5.Accuracy¶

In [73]:
print("accuracy: ",regressor_linear.score(x,y)*100)
accuracy:  100.0

6.R2-Score¶

In [74]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
import numpy as np
from sklearn.metrics import mean_squared_error

# Predicting Cross Validation Score the Test set results
cv_linear = cross_val_score(estimator = regressor_linear, X = x_train, y = y_train, cv = 10)

# Predicting R2 Score the Train set results
y_pred_linear_train = regressor_linear.predict(x_train)
r2_score_linear_train = r2_score(y_train, y_pred_linear_train)

# Predicting R2 Score the Test set results
y_pred_linear_test = regressor_linear.predict(x_test)
r2_score_linear_test = r2_score(y_test, y_pred_linear_test)

# Predicting RMSE the Test set results
rmse_linear = (np.sqrt(mean_squared_error(y_test, y_pred_linear_test)))
print("CV: ", cv_linear.mean())
print('R2_score (train): ', r2_score_linear_train)
print('R2_score (test): ', r2_score_linear_test)
print("RMSE: ", rmse_linear)
CV:  1.0
R2_score (train):  1.0
R2_score (test):  1.0
RMSE:  2.950017704202002e-13

B.Logistic Regression =¶

In [75]:
from sklearn.linear_model import LogisticRegression 
classifier = LogisticRegression(random_state = 0) 
classifier.fit(x_train, y_train) 
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Out[75]:
LogisticRegression(random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(random_state=0)

1.Predict¶

In [76]:
y_pred = classifier.predict(x_test)
print(y_pred)
[2 2 2 ... 2 2 2]
In [77]:
y
Out[77]:
array([[3],
       [2],
       [3],
       ...,
       [2],
       [2],
       [2]])

2.Compare Actual & Predicted¶

In [78]:
y_test=y_test.ravel()
y_pred=y_pred.ravel()
In [79]:
actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
actual_pred
Out[79]:
Actual Predicted
0 2 2
1 3 2
2 3 2
3 2 2
4 2 2
... ... ...
143517 3 2
143518 2 2
143519 2 2
143520 3 2
143521 3 2

143522 rows × 2 columns

In [80]:
include=10
actual=actual_pred.head(include)
actual.plot(kind='bar')
plt.grid()
plt.show()

3. Training and Testing Score¶

In [81]:
print("Training score: ",classifier.score(x_train,y_train))
print("Testing score: ",classifier.score(x_test,y_test))
Training score:  0.6862435947970044
Testing score:  0.685455888295871

4. Accuracy¶

In [82]:
print("accuracy: ",classifier.score(x,y)*100)
accuracy:  68.60072825173597

5. R2-Score¶

In [83]:
import warnings

from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
import numpy as np
from sklearn.metrics import mean_squared_error

# Predicting Cross Validation Score the Test set results
cv_classifier = cross_val_score(estimator = classifier, X = x_train, y = y_train, cv = 10)

# Predicting R2 Score the Train set results
y_pred_classifier_train = classifier.predict(x_train)
r2_score_classifier_train = r2_score(y_train, y_pred_linear_train)

# Predicting R2 Score the Test set results
y_pred_classifier_test = classifier.predict(x_test)
r2_score_classifier_test = r2_score(y_test, y_pred_classifier_test)

# Predicting RMSE the Test set results
rmse_classifier = (np.sqrt(mean_squared_error(y_test, y_pred_classifier_test)))
print("CV: ", cv_classifier.mean())
print('R2_score (train): ', r2_score_classifier_train)
print('R2_score (test): ', r2_score_classifier_test)
print("RMSE: ", rmse_classifier)
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

CV:  0.6859270689813117
R2_score (train):  1.0
R2_score (test):  -0.23202899548422828
RMSE:  0.6061928699443855
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

C.Lasso =¶

In [84]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(x_train, y_train)
Out[84]:
Lasso(alpha=0.1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.1)

1.Predict¶

In [85]:
y_pred = lasso.predict(x_test)
print(y_pred)
[2.01144291 2.65454914 2.72569111 ... 2.01144291 2.69855938 2.6412486 ]
In [86]:
y
Out[86]:
array([[3],
       [2],
       [3],
       ...,
       [2],
       [2],
       [2]])

2.Compare Actual & Predicted¶

In [87]:
y_test=y_test.ravel()
y_pred=y_pred.ravel()
In [88]:
actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
actual_pred
Out[88]:
Actual Predicted
0 2 2.011443
1 3 2.654549
2 3 2.725691
3 2 2.011443
4 2 1.995636
... ... ...
143517 3 2.646148
143518 2 2.030663
143519 2 2.011443
143520 3 2.698559
143521 3 2.641249

143522 rows × 2 columns

In [89]:
include=10
actual=actual_pred.head(include)
actual.plot(kind='bar')
plt.grid()
plt.show()

3. Training and Testing Score¶

In [90]:
print("Training score: ",lasso.score(x_train,y_train))
print("Testing score: ",lasso.score(x_test,y_test))
Training score:  0.8847571362974157
Testing score:  0.8846696817277504

4. Accuracy¶

In [91]:
print("accuracy: ",lasso.score(x,y)*100)
accuracy:  88.47309296424297

5. R2-Score¶

In [92]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
import numpy as np
from sklearn.metrics import mean_squared_error

# Predicting Cross Validation Score the Test set results
cv_lasso = cross_val_score(estimator = lasso, X = x_train, y = y_train, cv = 10)

# Predicting R2 Score the Train set results
y_pred_lasso_train = lasso.predict(x_train)
r2_score_lasso_train = r2_score(y_train, y_pred_linear_train)

# Predicting R2 Score the Test set results
y_pred_lasso_test = lasso.predict(x_test)
r2_score_lasso_test = r2_score(y_test, y_pred_lasso_test)

# Predicting RMSE the Test set results
rmse_lasso = (np.sqrt(mean_squared_error(y_test, y_pred_lasso_test)))
print("CV: ", cv_lasso.mean())
print('R2_score (train): ', r2_score_lasso_train)
print('R2_score (test): ', r2_score_lasso_test)
print("RMSE: ", rmse_lasso)
CV:  0.8847480531974219
R2_score (train):  1.0
R2_score (test):  0.8846696817277504
RMSE:  0.18546933066976745

D.Decision Tree =¶

In [93]:
# Fitting the Decision Tree Regression Model to the dataset
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state = 0)
dt.fit(x_train, y_train)
Out[93]:
DecisionTreeRegressor(random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(random_state=0)

1.Predict¶

In [94]:
y_pred_dt_test = dt.predict(x_test)
y_pred_dt_test
Out[94]:
array([2., 3., 3., ..., 2., 3., 3.])

2.Compare Actual & Predicted¶

In [95]:
y_test=y_test.ravel()
y_pred_dt_test =y_pred_dt_test.ravel()

actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_dt_test})
actual_pred
Out[95]:
Actual Predicted
0 2 2.0
1 3 3.0
2 3 3.0
3 2 2.0
4 2 2.0
... ... ...
143517 3 3.0
143518 2 2.0
143519 2 2.0
143520 3 3.0
143521 3 3.0

143522 rows × 2 columns

3.Training and Testing Score¶

In [96]:
print("Training score: ",dt.score(x_train,y_train))
print("Testing score: ",dt.score(x_test,y_test))
Training score:  1.0
Testing score:  1.0

4.Accuracy¶

In [97]:
print("accuracy: ",dt.score(x,y)*100)
accuracy:  100.0

5.R2-Score¶

In [98]:
from sklearn.metrics import r2_score

# Predicting Cross Validation Score
cv_dt = cross_val_score(estimator = dt, X = x_train, y = y_train, cv = 10)

# Predicting R2 Score the Train set results
y_pred_dt_train = dt.predict(x_train)
r2_score_dt_train = r2_score(y_train, y_pred_dt_train)

# Predicting R2 Score the Test set results
y_pred_dt_test = dt.predict(x_test)
r2_score_dt_test = r2_score(y_test, y_pred_dt_test)

# Predicting RMSE the Test set results
rmse_dt = (np.sqrt(mean_squared_error(y_test, y_pred_dt_test)))
print('CV: ', cv_dt.mean())
print('R2_score (train): ', r2_score_dt_train)
print('R2_score (test): ', r2_score_dt_test)
print("RMSE: ", rmse_dt)
CV:  1.0
R2_score (train):  1.0
R2_score (test):  1.0
RMSE:  0.0

E.Random Forest =¶

In [100]:
# Fitting the Random Forest Regression to the dataset
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 500, random_state = 0)
rf.fit(x_train, y_train.ravel())
Out[100]:
RandomForestRegressor(n_estimators=500, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(n_estimators=500, random_state=0)

1.Predict¶

In [101]:
y_pred_rf_test = rf.predict(x_test)
y_pred_rf_test
Out[101]:
array([2., 3., 3., ..., 2., 3., 3.])

2.Compare Actual & Predicted¶

In [102]:
y_test=y_test.ravel()
y_pred_dt_test =y_pred_rf_test.ravel()

actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_rf_test})
actual_pred
Out[102]:
Actual Predicted
0 2 2.0
1 3 3.0
2 3 3.0
3 2 2.0
4 2 2.0
... ... ...
143517 3 3.0
143518 2 2.0
143519 2 2.0
143520 3 3.0
143521 3 3.0

143522 rows × 2 columns

3.Training and Testing Score¶

In [103]:
print("Training score: ",rf.score(x_train,y_train))
print("Testing score: ",rf.score(x_test,y_test))
Training score:  1.0
Testing score:  1.0

4.Accuracy¶

In [104]:
print("accuracy: ",rf.score(x,y)*100)
accuracy:  100.0

5.R2-Score¶

In [ ]:
from sklearn.metrics import r2_score

# Predicting Cross Validation Score
cv_rf = cross_val_score(estimator = rf, X = x_train, y = y_train.ravel(), cv = 10)

# Predicting R2 Score the Train set results
y_pred_rf_train = rf.predict(x_train)
r2_score_rf_train = r2_score(y_train, y_pred_rf_train)

# Predicting R2 Score the Test set results
y_pred_rf_test = rf.predict(x_test)
r2_score_rf_test = r2_score(y_test, y_pred_rf_test)

# Predicting RMSE the Test set results
rmse_rf = (np.sqrt(mean_squared_error(y_test, y_pred_rf_test)))
print('CV: ', cv_rf.mean())
print('R2_score (train): ', r2_score_rf_train)
print('R2_score (test): ', r2_score_rf_test)
print("RMSE: ", rmse_rf)

F.Support Vector Regression =¶

In [ ]:
# Importing the required libraries
from sklearn.svm import SVR

# Creating an instance of the model
svr=SVR()

# Fitting the model to the training data
svr.fit(x_train,y_train)

1.Predict¶

In [ ]:
y_pred_svr_test = svr.predict(x_test)
y_pred_svr_test

2.Compare Actual & Predicted¶

In [ ]:
y_test=y_test.ravel()
y_pred_svr_test =y_pred_svr_test.ravel()

actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_svr_test})
actual_pred

3.Training and Testing Score¶

In [ ]:
print("Training score: ",svr.score(x_train,y_train))
print("Testing score: ",svr.score(x_test,y_test))

4.Accuracy¶

In [ ]:
print("accuracy: ",svr.score(x,y)*100)

5.R2-Score¶

In [ ]:
from sklearn.metrics import r2_score

# Predicting Cross Validation Score
cv_svr = cross_val_score(estimator = svr, X = x_train, y = y_train.ravel(), cv = 10)

# Predicting R2 Score the Train set results
y_pred_svr_train = svr.predict(x_train)
r2_score_svr_train = r2_score(y_train, y_pred_svr_train)

# Predicting R2 Score the Test set results
y_pred_svr_test = svr.predict(x_test)
r2_score_svr_test = r2_score(y_test, y_pred_svr_test)

# Predicting RMSE the Test set results
rmse_svr = (np.sqrt(mean_squared_error(y_test, y_pred_svr_test)))
print('CV: ', cv_svr.mean())
print('R2_score (train): ', r2_score_svr_train)
print('R2_score (test): ', r2_score_svr_test)
print("RMSE: ", rmse_svr)

G. Linear Regression in Deep Learning by using TensorFlow =¶

In [ ]:
import tensorflow as tf
import numpy as np
In [ ]:
# Define your model
model = tf.keras.models.Sequential([
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.Dense(1)
])
In [ ]:
# Compile your model
model.compile(optimizer='adam', loss='mse')
In [ ]:
# Make predictions on the test set
y_pred = model.predict(x_test)
In [ ]:
# Make predictions on the test set
y_pred = model.predict(x_test)

Conclusion:¶

The linear regression model can demonstrates a strong accurately predict crime in which area as based on the number of crime areas.For example,a victim made crimes in 9 areas/day is predicted to score approximately 90.467% in the report . A successful model for predicting crime area based on reporting area with a high accuracy of 100.0%¶